{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 基础理论部分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0. Can you come up out 3 sceneraies which use AI methods? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 火车站的人脸识别；手机上的语音助手；特斯拉的自动驾驶"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. How do we use Github; Why do we use Jupyter and Pycharm;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: GitHub传道授业解惑，提交作业；Jupyter做学习笔记，方便解析整个过程，而Pycharm是IDE，用于开发项目"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. What's the Probability Model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 用来计算输入出现在总体样本的概率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Can you came up with some sceneraies at which we could use Probability Model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 输入法的推荐后面的字"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Why do we use probability and what's the difficult points for programming based on parsing and pattern match?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 用来判断事物的可能性或者预测事情的走向；样本需要足够大并保证一定的正确性"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. What's the Language Model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 按照给定的pattern生成句子，判断哪个句子更为真实"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. Can you came up with some sceneraies at which we could use Language Model?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 语音识别；机器翻译"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7. What's the 1-gram language model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 单个词出现的概率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8. What's the disadvantages and advantages of 1-gram language model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: Dis——无法关联性预测；Adv——可提供大量基础数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 9. What't the 2-gram models;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 用来根据前1个item来预测后一个item"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 编程实践部分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 设计你自己的句子生成器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import re\n",
    "import jieba\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = '''\n",
    "sentence = people verb noun\n",
    "people = Alice | Beyond | Trump\n",
    "verb = do | doing\n",
    "do = plays | sings\n",
    "doing = is act\n",
    "act = playing | singing\n",
    "noun = basketball | song | music\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {},
   "outputs": [],
   "source": [
    "ans = '''\n",
    "ans = a verb b noun\n",
    "a = you | he | she | I\n",
    "verb = turn | ask | find\n",
    "b = dracula | fairy\n",
    "noun = 1 inter\n",
    "1 = for | to\n",
    "inter = help | advice | suggestion\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [],
   "source": [
    "def grammar(gram_str, split='='):\n",
    "    grammar = {}\n",
    "    for line in gram_str.split('\\n'):\n",
    "        if not line.strip(): continue\n",
    "        l, r = line.split(split)\n",
    "        grammar[l.strip()] = [s.split() for s in r.split('|')]\n",
    "    return grammar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate(gram, target):\n",
    "    if target not in gram: return target\n",
    "    expanded = [generate(gram, t) for t in random.choice(gram[target])]\n",
    "    #return ' '.join(x for x in expanded)\n",
    "    return ''.join(x for x in expanded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Trump plays song'"
      ]
     },
     "execution_count": 174,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate(grammar(sentence), 'sentence')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'he turn fairy for suggestion'"
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate(grammar(ans), 'ans')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_n(gram, target, n):\n",
    "    result = []\n",
    "    for i in range(n + 1):\n",
    "        inter = generate(grammar(gram), target)\n",
    "        result.append(inter)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['she find dracula for advice',\n",
       " 'he find fairy to suggestion',\n",
       " 'I find fairy to advice',\n",
       " 'I turn fairy to suggestion',\n",
       " 'you find fairy to suggestion',\n",
       " 'I ask dracula to advice',\n",
       " 'you find dracula for suggestion',\n",
       " 'she find fairy to advice',\n",
       " 'he ask fairy to advice',\n",
       " 'she turn dracula to suggestion',\n",
       " 'she find dracula to suggestion',\n",
       " 'she turn fairy for advice',\n",
       " 'he find dracula to advice',\n",
       " 'I ask fairy to suggestion',\n",
       " 'you turn fairy for help',\n",
       " 'you turn fairy to advice',\n",
       " 'I find dracula to suggestion',\n",
       " 'she turn fairy to suggestion',\n",
       " 'I ask fairy for advice',\n",
       " 'he ask dracula to help',\n",
       " 'she find dracula for advice']"
      ]
     },
     "execution_count": 208,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = generate_n(ans, 'ans', 20)\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 使用新数据源完成语言模型的训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = 'D:\\\\GitHub\\\\NLP\\\\data\\\\train.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(filename, encoding='utf-8') as f:\n",
    "    content = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1343520"
      ]
     },
     "execution_count": 179,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {},
   "outputs": [],
   "source": [
    "def token(string):\n",
    "    return re.findall('\\w+', string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [],
   "source": [
    "cut = Counter(jieba.cut(content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(' ', 241272),\n",
       " ('++', 77334),\n",
       " ('$', 38679),\n",
       " ('-', 13317),\n",
       " ('\\n', 12889),\n",
       " ('?', 12876),\n",
       " ('？', 12736),\n",
       " ('insurance', 11842),\n",
       " ('Insurance', 10011)]"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cut.most_common()[:9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {},
   "outputs": [],
   "source": [
    "content_clean = [''.join(token(content))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('train_clean.txt', 'w') as f:\n",
    "    for a in content_clean:\n",
    "        f.write(a + '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cut(string):\n",
    "    return list(jieba.cut(string))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "metadata": {},
   "outputs": [],
   "source": [
    "TOKEN = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, line in enumerate(open('train_clean.txt')):\n",
    "    TOKEN += cut(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "metadata": {},
   "outputs": [],
   "source": [
    "TOKEN = [str(t) for t in TOKEN]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "metadata": {},
   "outputs": [],
   "source": [
    "TOKEN_2_GRAM = [''.join(TOKEN[i:i + 2]) for i in range(len(TOKEN[:-2]))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "87966"
      ]
     },
     "execution_count": 190,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(TOKEN_2_GRAM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {},
   "outputs": [],
   "source": [
    "words_count = Counter(TOKEN_2_GRAM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prob_2(word1, word2):\n",
    "    if word1 + word2 in words_count: return words_count[word1 + word2] / len(TOKEN_2_GRAM)\n",
    "    else:\n",
    "        return 1 / len(TOKEN_2_GRAM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.273605711297547e-05"
      ]
     },
     "execution_count": 194,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('买','保险')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1368028556487734e-05"
      ]
     },
     "execution_count": 195,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('卖','糖')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1368028556487734e-05"
      ]
     },
     "execution_count": 196,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('去','游玩')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.013061864811404407"
      ]
     },
     "execution_count": 197,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('什么','是')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. 获得最优质的的语言"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当我们能够生成随机的语言并且能判断之后，我们就可以生成更加合理的语言了。请定义 generate_best 函数，该函数输入一个语法 + 语言模型，能够生成**n**个句子，并能选择一个最合理的句子: \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3, 5]"
      ]
     },
     "execution_count": 198,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([1, 3, 5, 2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个函数接受一个参数key，这个参数接受一个函数作为输入，例如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 4), (2, 5), (4, 4), (5, 0)]"
      ]
     },
     "execution_count": 199,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第0个元素进行排序."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(5, 0), (1, 4), (4, 4), (2, 5)]"
      ]
     },
     "execution_count": 200,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第1个元素进行排序."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(2, 5), (1, 4), (4, 4), (5, 0)]"
      ]
     },
     "execution_count": 201,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[1], reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第1个元素进行排序, 但是是递减的顺序。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 209,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_probablity(sentence):\n",
    "    words = cut(sentence)   \n",
    "    sentence_pro = 1    \n",
    "    for i, word in enumerate(words[:-1]):\n",
    "        next_ = words[i + 1]\n",
    "        probability = prob_2(word, next_)        \n",
    "        sentence_pro *= probability    \n",
    "    return sentence_pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 220,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_best(): # you code here\n",
    "    sentence = '''\n",
    "    sentence = a verb noun\n",
    "    a = 医疗保险 | 债权人 | 费用\n",
    "    verb = 有 | 是 | 涵盖 \n",
    "    noun = 什么 | 时间 | 特点 | 多少\n",
    "    '''\n",
    "    result = generate_n(sentence, 'sentence', 6)\n",
    "    pro_list = []\n",
    "    for i in result:\n",
    "        pro = get_probablity(i)\n",
    "        pro_list.append((pro, i))\n",
    "    pro_list = sorted(pro_list, key=lambda x: x[0], reverse=True)\n",
    "    return pro_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 221,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(8.238544670396437e-07, '医疗保险是什么'),\n",
       " (2.326177318700171e-09, '医疗保险涵盖特点'),\n",
       " (3.876962197833618e-10, '费用涵盖多少'),\n",
       " (1.292320732611206e-10, '费用涵盖时间'),\n",
       " (1.292320732611206e-10, '费用涵盖特点'),\n",
       " (1.292320732611206e-10, '债权人有时间'),\n",
       " (1.292320732611206e-10, '债权人涵盖时间')]"
      ]
     },
     "execution_count": 221,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_best()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: 这个模型有什么问题？ 你准备如何提升？ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: 混有中英文，需要将英文和中文处理下"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
